Segment
Create a data visualisation to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv should be used.
Data set consists of other categories of drinks which need to be filtered out.
Drinks have different portions and hence secondary measures need to be derived for the nutrition indicators to facilitate like-for-like comparison among the drinks of different portions.
Different drinks have the same name under the “Name” variable but differ in “Size”, “Milk” and “Whipped Cream”. Hence, there is a need to assign unique names to each record.
Variables may display collinearity and hence there is a need to identify and eliminate collinear variables before performing clustering.
The number of segments are unknown and hence there is a need to optimise the number of segments.
packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse', 'GGally', 'factoextra', 'NbClust')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
all_data <- read.csv("data/starbucks_drink.csv")
kids_drink <- filter(all_data, Category == "kids-drinks-and-other")
kids_drink$UniqueName <- paste(kids_drink$Name, kids_drink$Size ,kids_drink$Milk ,kids_drink$Whipped.Cream)
kids_drink$Calories.poz <- kids_drink$Calories / kids_drink$Portion.fl.oz.
kids_drink$Calories.fat.poz <- kids_drink$Calories.from.fat / kids_drink$Portion.fl.oz.
kids_drink$Total.Fat.g.poz <- kids_drink$Total.Fat.g. / kids_drink$Portion.fl.oz.
kids_drink$Saturated.fat.g.poz <- kids_drink$Saturated.fat.g. / kids_drink$Portion.fl.oz.
kids_drink$Trans.fat.g.poz <- kids_drink$Trans.fat.g. / kids_drink$Portion.fl.oz.
kids_drink$Cholesterol.mg.poz <- kids_drink$Cholesterol.mg. / kids_drink$Portion.fl.oz.
kids_drink$Sodium.mg.poz <- kids_drink$Sodium.mg. / kids_drink$Portion.fl.oz.
kids_drink$Total.Carbohydrate.g.poz <- kids_drink$Total.Carbohydrate.g. / kids_drink$Portion.fl.oz.
kids_drink$Dietary.Fiber.g.poz <- kids_drink$Dietary.Fiber.g. / kids_drink$Portion.fl.oz.
kids_drink$Sugars.g.poz <- kids_drink$Sugars.g. / kids_drink$Portion.fl.oz.
kids_drink$Protein.g.poz <- kids_drink$Protein.g. / kids_drink$Portion.fl.oz.
kids_drink$Caffeine.mg. <- as.numeric(kids_drink$Caffeine.mg.)
kids_drink$Caffeine.mg.poz <- kids_drink$Caffeine.mg. / kids_drink$Portion.fl.oz.
ggpairs(kids_drink[,20:31], upper = list(continuous = wrap("cor", size = 3)))
The following strong correlations were observed:
Calories per oz and Calories from fat per oz are highly correlated, hence Calories from fat per oz could be removed from the analysis.
Total Fat per oz and Saturated Fat per oz are are highly correlated. Since Total Fat per oz is a sum of the the constituent types of fat, Total Fat per oz can be removed from the analysis. A new variable for Non-saturated non-trans fat per oz can be created.
kids_drink$NonSatNonTrans.g.poz <- kids_drink$Total.Fat.g.poz - kids_drink$Saturated.fat.g.poz - kids_drink$Trans.fat.g.poz
kids_drink$NonSugarNonFibreCarb.g.poz <- kids_drink$Total.Carbohydrate.g.poz - kids_drink$Sugars.g.poz - kids_drink$Dietary.Fiber.g.poz
row.names(kids_drink) <- kids_drink$UniqueName
kids_drink_matrix <- dplyr::select(kids_drink, c(20,23:26,28:33))
kids_drink_matrix <- data.matrix(kids_drink_matrix)
Using the Elbow method, determine the optimal number of clusters
# Elbow method
fviz_nbclust(kids_drink_matrix, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
heatmaply(normalize(kids_drink_matrix),
Colv=NA,
seriate = "none",
colors = Blues,
k_row = 4,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Nutrition Indicators by Drink \nData Transformation using Normalise Method",
xlab = "Nutrition Indicators",
ylab = "Kids Drink & Others"
)
There were 4 clusters obtained: